Create Dataset

ML

1. Feature Selection

246 variables have more than 800 null values (some Amenities are unique); 48 variables have null values between 750 and 799 bucket.

2. Data Cleaning

3. EDA

The HeatMap maps the intensity of the data onto a color gradient. Red means this place has more house listed on Airbnb.

Interpretation: From the graph we could see that in the ratings distribution, the frequency of 4.9 to 5.0 rating is quite higher. The average Price and the Price range for California is highest, as expected, for both, origginal and discounted price. There is not much correlation in given 3 variables, apart from correlation in discount and original price

Interpretation:

While analysing the relationship between Percentage Discount that property provided and the Rating using plot, we could see that there is no significant linear relationship between the two variables.

Compare ratings for properties as per availability of different Amenitiy to check the impact of amenities on the ratings

With these Amenities: washer, dryer, mountain view, tv standard cable, indoor fireplace and hot tub, the avg rating is 0.1 higher.

3.1 Distribution of features

3.2 Correlation of features

All the Service_Ratings have high positive correlation with the overall rating. Except for ratings in different aspects, Superhost and Rating also have high positive correlation (0.397424). Whereas, Amenities_washer, Amenities_dryer and Ratings have high negative correlations.

4. Data Processing

4.1 Normalize Data and One-hot encoding

Log can transformation help to make the data more normally distributed, which can improve the performance of Tree based models

4.2 Get the final dataset for Tree based models

4.3 Split Train and Test data

5. Regression

5.1 Linear Regression

5.2 Lasso Regression

Lasso Regression has less MSE and the coef of each selected features are listed.

6. PCR

We should perform PCA on the training data only. This is because you want to prevent information leakage from the test set, which would lead to an overestimation of the model's performance.

7. Random Forest

7.1 Hyperparameter tuning

7.2 Calculate MSE from test data

7.3 Feature Importance

8. XGBoost

8.1 Hyperparameter tuning

8.2 Calculate MSE from test data

8.3 Feature Importance

9. LightGBM

9.1 Hyperparameter tuning

9.2 Feature Importance: contribution of each feature in the model in terms of the reduction in the MSE of the splits

'Reviews', 'Original Price' and 'Discounted_rate' are important features in terms of the overall reduction of MSE. We will combine SHAP value later to give suggestions to listing owner and customers.

9.3 SHAP: how much each feature contributes to the output of the model

9.3.1 Drill down to Service_Ratings

  1. 'Service_Ratings_cleanliness' is the most important feature that influences overall 'Rating'. Suggestions to listing owners: try to improve this rating category to 4.6 or higher.
  2. 'Service_Ratings_value' is the second important feature that owners need to manage to increase
  3. For 'Service_Ratings_accuracy', high value could have both positive and negative influences to 'Ratings'. But low value definitely have negative impact to the 'Rating'. Should not put much effort to 'Service_Ratings_accuracy' as long as it is not very low.
  4. 'Location' and 'Checkin' are the least aspect that people care about when rating the experience in airbnb.

9.3.2 Drill down to Original Price

Most of the lower ratings are from houses that have cheap original price. Suggestion for customers: Renting rooms that price are higher than $200 would more likely to gain a good experience.

9.3.3 Drill down to Superhost

Rooms whose listing owners are 'Superhost' are more likely to get higher ratings. Listing owners can try to obtain this award and customers can use this option as an important criteria when selecting houses.

9.3.4 Drill down to Reviews

Most of the houses that have ratings higher than 4.9 have less than 250 reviews. Suggestion for listing owners: total number of reviews does not matter. It is ok if customers just rate but not write reviews.

9.3.5 Drill down to Reviews

Houses in OR are more likely to have bad ratings. Suggestion for customers: Look at the reviews more carefully when choosing an Airbnb when traveling to OR comparing to CA and WA.

9.3.6 Drill down to Amenities_refrigerator

Customers are more likely to rate lower when there is a refrigerator in the house. Suggestion for listing owners: Check more about what people comment on the refrigerator and improve. If it is hard to monitor the function or cleanness of the refrigerator, it is ok to not provide.

9.3.7 Drill down to Highlights_great checkin experience

People who have great checkin experience rated lower mainly because of value and cleanliness. Suggestions for listing owners: put more effort to overall value and cleanliness of the house after the customers have great checkin experience

10. In conclusion:

For listing owners:

  1. Try to improve 'Service_Ratings_cleanliness' rating category to 4.6 or higher.
  2. 'Service_Ratings_value' is the second important feature that owners need to manage to increase
  3. Should not put much effort to 'Service_Ratings_accuracy' as long as it is not very low.
  4. 'Location' and 'Checkin' are the least aspect that people care about when rating the experience in airbnb.
  5. Listing owners can try to obtain 'Superhost' award to have higher ratings.
  6. Total number of reviews does not matter. It is ok if customers just rate but not write reviews.
  7. Check more about what people comment on the refrigerator and improve. If it is hard to monitor the function or cleanness of the refrigerator, it is ok to not provide.
  8. Put more effort to overall value and cleanliness of the house after the customers have great checkin experience.

For customers:

  1. Renting rooms that price are higher than 200 would more likely to gain a good experience.
  2. customers can use 'Superhost' as an important criteria when selecting houses.
  3. Look at the reviews more carefully when choosing an Airbnb when traveling to OR comparing to CA and WA.

11. Models Comparison

Lasso Regression has the minimum MSE among all the modes. LightGBM has the minimum MSE among tree based models but can deal with non-linear relationship and easy to interpret.